supports collective training with programs#18392
supports collective training with programs#18392gavin1332 merged 5 commits intoPaddlePaddle:developfrom
Conversation
a2a7b5d to
dbe90e1
Compare
test=develop
guru4elephant
left a comment
There was a problem hiding this comment.
Please take care of the distributed arguments in ParamAttr
python/paddle/fluid/param_attr.py
Outdated
| gradient_clip=None, | ||
| do_model_average=False): | ||
| do_model_average=False, | ||
| distributed=False): |
There was a problem hiding this comment.
do you have example for distributed=True?
I wonder is it possible to inference the value of distributed through op type?
There was a problem hiding this comment.
solution is found, so that we will remove this attribute from ParamAttr. And as it is in the Distributed FC domain, we will handle this parameter in the next pr.
|
|
||
| def __init__(self): | ||
| Collective.__init__(self) | ||
| def __init__(self, nrings=2): |
There was a problem hiding this comment.
The minimum number for parallel comms/streams. As it is no harm for paralleling collective communication in GradAllReduce mode, we prefer this value as the default to 1 which refers to no parallel at all.
guru4elephant
left a comment
There was a problem hiding this comment.
and also please remove shard_index_op in this PR
test=develop
test=develop
as the shard_index_op is also in Distributed FC domain, we have removed it. |
test=develop
| static_cast<int>(OpRole::kCollective) | | ||
| static_cast<int>(OpRole::kBackward), | ||
| static_cast<int>(OpRole::kCollective) | | ||
| static_cast<int>(OpRole::kOptimize), |
There was a problem hiding this comment.
Op role will increase the complexity of the Graph and Program analysis. I do not recommend adding new o prole.
There was a problem hiding this comment.
i agree with u, and i will try to remove newly added op roles. However, as collective ops swap data among trainers, their behaviors is different from backward and optimize ops more or less, i will create a new pr to discuss this topic if necessary. 3ks.
If the operator in all reduce is sum, does it mean that all reduce is used for gradient aggregation? What logic is this?
Why? This is an unreasonable assumption. |
we are trying to introduce a model parallel strategy to train extreme large classification problem in face recognition, which has up to 10 million classes and the size of the last fc parameter is beyond the GPU memory. So that we have to separate parameter into multiple cards and call collective ops in the forward phase besides the gradient aggregation. |
recent research has been done for algorithms to accelerate deep learning training, namely LocalSGD, which allreduce and average the parameters in the optimization phase instead of allreducing the gradient in the backward phase, which our assumption is based on. |
guru4elephant
left a comment
There was a problem hiding this comment.
LGTM. Should add collective op backward ut if possible.
…nalysis test=develop
test=develop